Skip to content

feat(fts): thread DataFusion MemoryPool through inverted index build#7314

Draft
wjones127 wants to merge 1 commit into
lance-format:mainfrom
wjones127:worktree-inherited-greeting-marshmallow
Draft

feat(fts): thread DataFusion MemoryPool through inverted index build#7314
wjones127 wants to merge 1 commit into
lance-format:mainfrom
wjones127:worktree-inherited-greeting-marshmallow

Conversation

@wjones127

Copy link
Copy Markdown
Contributor

Previously the FTS builder used a bespoke per-worker memory watermark
(LANCE_FTS_PARTITION_SIZE env var, default 2 GiB) to decide when to flush a
posting-list partition. Each worker tracked usage independently and flushed when it
crossed its private limit.

Changes

  • Replace worker_memory_limit_bytes in IndexWorkerConfig / IndexWorker with a
    MemoryReservation drawn from a shared FairSpillPool.
  • Create one FairSpillPool per build, sized to the total memory budget
    (memory_limit_mb × 1 MiB, or LANCE_FTS_PARTITION_SIZE × number of workers as
    the default), and pass it into every IndexWorker via IndexWorkerConfig.
  • After each document, sync the reservation with try_grow / shrink. When
    try_grow returns Err the pool is exhausted — flush the current partition and
    free the reservation.
  • Guard against a single document that is larger than the entire pool by returning an
    error when try_grow fails on an otherwise-empty builder.

Test plan

  • All 25 existing builder unit tests still pass.
  • New test_memory_pool_spills_on_tight_budget: constructs an IndexWorker with a
    72 KiB FairSpillPool, processes 20 docs each contributing 300 unique tokens, and
    asserts that at least one completed partition was written (proving pool-triggered
    spill occurred).

Closes #7304.

Previously the FTS builder used a bespoke per-worker memory watermark
(`LANCE_FTS_PARTITION_SIZE` env var, default 2 GiB) to decide when to flush
a posting-list partition. Each worker tracked usage independently and flushed
when it crossed its private limit.

This replaces that mechanism with a DataFusion `FairSpillPool` shared across
all workers for a given build. Each worker holds a `MemoryReservation` and
calls `try_grow` after each document; when the pool is exhausted `try_grow`
returns `Err` which triggers a flush, keeping total build memory within the
user-configured budget.

Fixes lance-format#7304.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer enhancement New feature or request labels Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Thread MemoryPool through FTS / inverted index build

1 participant